Lexical Knowledge Acquisition from Bilingual Corpora
نویسندگان
چکیده
I)br practical research in natnral language processing, it is indisl)ensM)le to develop a large scale semantic dictionary for computers. It is cspeciany important to improve thc tcclmiqucs tbr compiling semantic dictionaries ti'orn natural language texts such as those in existing human dictionaries or in large corpora, llowever, there are at least two ditlicultics in analyzing existing texts: tbe l)roblem of syntactic ambiguities and the probtcm of polysemy. Our approaclL to solve these difficulties is to make use of translation exampies in two distinct languages that have (lnite different syntactic structures and word meanings. The roe.son we took this at)preach is that in many cases both syn: tactic aLrd semantic ambignitics arc resolved by comparing analyzed resnlts from botb languages. In this paper, we propose a method Ibr resolving the syntactic ambiguities of translation cxaml>lcs of bilingual corpora and a method for acquiring lexical knowledge, such as ease frames of verbs and attribute sets el noons. 1 I n t r o d u c t i o n It has become widely accel)ted that developing a large scale semantic dictionary is indispensable to future natural language research. ILL recent years, several research activities for compiling selnantic dictionaries tot natural language processing have been uudcrtaken One of the approaches in this research is attempts to compile dictionaries by band. Japan Electronic Dictionary Research Institute (El)R.) is now compiling conceptual dictionaries[5] by hand with the help of software tools. [nformation-4echnology Promotion Agency (IPA), Japan, has also compiled IPA Lexicon of the Japanese Language for computers (II'AL)[4]. IPAL has 861 entries for basic Jalranese verbs. Cyc project at tempts to assend)le a massive knowledge base covering human common-sense knowledge[7]. IIowever, this approach sailors from *The authol~ would like to t}mak the editorial staff of Kodazm|m for permission tO use the data of Jalmnese-12)nglidt dictionaa'y, arm also thank l)r. Shouichi YOKOYAMA, I,',TL, and Prof. l[ozumi TANAKA and Dr. '['akenobu TOKUNA(;A, Tokyo hmtitute of Teclmology, for providing us the data of Jal)ane~e-l~nglish dictionary. This work is partly supported by the Grants from Ministry of Education, #032,15103. probh'.Ins socb as a huge alnount of manila[ labor, difficulties in extending tile dictionaries, unstable remilts, and so forth. Anothcr approach is to compile dictionaries using some teclxmques of lexical knowledge acquisition. One ~nch approach is to extract hierarclfical relations or it thesanrtm of conceptual items froln hunLall dictionaries in an automatic way. q)surrnnaru et el. studied to construct a t}LeSaLLrlIs of nominal concepts from noun detinitions[t3], qbmiara et al. also extracted snperordinatc-subordmatc relation between verbs from the defining sentences in IPAL[12]. l i e sidcs these rcseasches, there are other several research activitics tbr lexical knowledge acquisition, which syntactically anMyze the sentences m large corpora and at tcmpt to extract lcxical knowledge from statistical data [3] [1]. Most of the works undertake shallow analysis of texts and they extract only superticial lexical information. For the development of tile techniques of knowledge acquisition from natural language texts, it is very important to improve the httter approach of cornpiling semantic dictionaries by comimter l)rograuL~. Ilowever, there are at least two basic difficulties in this at)preach 1. Tire i~robh~m (ff sy n t ac t i c a m b i g u i t i e s When analyzing a sentence., syntactic ambiguities often remain. So i~ is not easy to obtain correct parsed results automatically. 2. The, probh~rrr of polyue,my it often happens that one word has several meanings and corre.sponrls to ,~cveral concepts. So it is not easy to associate one sm'fa~e word with olle correct conceptHal item. Our approach to solve these diiliculties is to make use of translatitm cxarnples in two distinct languages that have quite different syntactic structures and word mf~anings (such as English and Japanese), and to c(nnt~are analyzed results from each language, h| many (:asc~, the two languagcs }Lave different types of syntactic ambiguities, anti comparison of syntactic structures of both bmguagcs helps to resolve the ambiguities. Also, a pair of bilingually equivalent snrface words helps to a~'4ociate tile words with conceptual Ac~s DECOLING-92, NANTIiS, 23-28 AOIJT 1992 5 8 1 l'~oc. OF COL]NG-92. NANTES. AUG. 23-28, 1992 words helps to associate the words with conceptual items, because the intersection of conceptual items that each surface word has could be considered as one conceptual i tem[ll] [2]. ["or example, in tire case of the translation example given in Example 1, both syntactic and semantic ambiguities are resolved.
منابع مشابه
Effect of Cross-Language IR in Bilingual Lexicon Acquisition from Comparable Corpora
Within the framework of translation knowledge acquisition from WWW news sites, this paper studies issues on the effect of cross-language retrieval of relevant texts in bilingual lexicon acquisition from comparable corpora. We experimentally show that it is quite effective to reduce the candidate bilingual term pairs against which bilingual term correspondences are estimated, in terms of both co...
متن کاملMIPA: Mutual Information Based Paraphrase Acquisition via Bilingual Pivoting
We present a pointwise mutual information (PMI) based approach for formalizing paraphrasability and propose a variant of PMI, called mutual information based paraphrase acquisition (MIPA), for paraphrase acquisition. Our paraphrase acquisition method first acquires lexical paraphrase pairs by bilingual pivoting and then reranks them by PMI and distributional similarity. The complementary nature...
متن کاملTowards Semi Automatic Construction of a Lexical Ontology for Persian
Lexical ontologies and semantic lexicons are important resources in natural language processing. They are used in various tasks and applications, especially where semantic processing is evolved such as question answering, machine translation, text understanding, information retrieval and extraction, content management, text summarization, knowledge acquisition and semantic search engines. Altho...
متن کاملLearning Method for Automatic Acquisition of Translation Knowledge
This paper presents a new learning method for automatic acquisition of translation knowledge from parallel corpora. We apply this learning method to automatic extraction of bilingual word pairs from parallel corpora. In general, similarity measures are used to extract bilingual word pairs from parallel corpora. However, similarity measures are insufficient because of the sparse data problem. Th...
متن کاملLearning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach
Recent years saw an increased interest in the use and the construction of large corpora. With this increased interest and awareness has come an expansion in the application to knowledge acquisition and bilingual terminology extraction. The present paper will seek to present an approach to bilingual lexicon extraction from non-aligned comparable corpora, combination to linguisticsbased pruning a...
متن کاملDisambiguating bilingual nominal entries against WordNet
One reason why the lexical capabilities of NLP systems have remained weak is because of the labour intensive nature of encoding lexical entries for the lexicon. It has been estimated that the average time needed to construct manually a lexical entry for a Machine Translation system is about 30 minutes [Neff et al. 93]. The automatic acquisition of lexical knowledge is the main field of the rese...
متن کامل